Dual-level Collaborative Transformer for Image Captioning

نویسندگان

چکیده

Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for lack contextual information and fine-grained details, which contrast merits traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize complementary advantages two Concretely, DLCT, these first processed Dual-way Self Attenion (DWSA) mine their intrinsic properties, where Comprehensive Relation Attention component is also introduced embed geometric information. addition, propose Locality-Constrained Cross module address semantic noises caused direct fusion features, alignment graph constructed accurately align reinforce To validate our model, conduct extensive experiments on highly competitive MS-COCO dataset, achieve new state-of-the-art performance both local online test sets, i.e., 133.8% CIDEr Karpathy split 135.4% official split.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contrastive Learning for Image Captioning

Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...

متن کامل

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...

متن کامل

Phrase-based Image Captioning

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representat...

متن کامل

Domain-Specific Image Captioning

We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new q...

متن کامل

Convolutional Image Captioning

Image captioning is an important but challenging task, applicable to virtual assistants, editing tools, image indexing, and support of the disabled. Its challenges are due to the variability and ambiguity of possible image descriptions. In recent years significant progress has been made in image captioning, using Recurrent Neural Networks powered by long-short-term-memory (LSTM) units. Despite ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i3.16328